{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(rectangling)=\n",
    "# Nested Data\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In this chapter, you'll learn about **nested data**, working with data that is fundamentally tree-like and (often) converting it into a rectangular data frames made up of rows and columns. This is important because nested data is surprisingly common, especially when working with data that comes from a web API (such as you will see in {ref}`webscraping-and-apis`).\n",
    "\n",
    "To learn about rectangling, you'll first learn about lists, dictionaries, and the JSON format, as these are the data structures that are most often used to work with hierarchical data in Python. Then you'll learn about some functions that can help you turn hierarchical data into 'tidy' data in columns and rows. We'll then show you a few case studies, applying these simple function multiple times to solve real complex problems.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a55374",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove cell\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib_inline.backend_inline\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
    "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17575f3a",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "This chapter will use the **pandas** data analysis package."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fcd8093",
   "metadata": {},
   "source": [
    "## Lists\n",
    "\n",
    "Lists are a really useful way to work with lots of data at once. They're defined with square brackets, with entries separated by commas. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d1c3fa4",
   "metadata": {},
   "outputs": [],
   "source": [
    "list_example = [10, 1.23, \"like this\", True, None]\n",
    "print(list_example)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29b10d07",
   "metadata": {},
   "source": [
    "You can also construct them by appending entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96c4b86b",
   "metadata": {},
   "outputs": [],
   "source": [
    "list_example.append(\"one more entry\")\n",
    "print(list_example)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8d4f6ed",
   "metadata": {},
   "source": [
    "And you can access earlier entries using an index, which begins at 0 and ends at one less than the length of the list (this is the convention in many programming languages). For instance, to print specific entries at the start, using `0`, and end, using `-1`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "138ac0e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(list_example[0])\n",
    "print(list_example[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "920d19f4",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "How might you access the penultimate entry in a list object if you didn't know how many elements it had?\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6aea9157",
   "metadata": {},
   "source": [
    "As well as accessing positions in lists using indexing, you can use *slices* on lists. This uses the colon character, `:`, to stand in for 'from the beginning' or 'until the end' (when only appearing once). For instance, to print just the last two entries, we would use the index `-2:` to mean from the second-to-last onwards. Here are two distinct examples: getting the first three and last three entries to be successively printed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef390263",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(list_example[:3])\n",
    "print(list_example[-3:])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c82b5c4a",
   "metadata": {},
   "source": [
    "Slicing can be even more elaborate than that because we can jump entries using a second colon. Here's a full example that begins at the second entry (remember the index starts at 0), runs up until the second-to-last entry (exclusive), and jumps every other entry inbetween (range just produces a list of integers from the value to one less than the last):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13584953",
   "metadata": {},
   "outputs": [],
   "source": [
    "list_of_numbers = list(range(1, 11))\n",
    "start = 1\n",
    "stop = -1\n",
    "step = 2\n",
    "print(list_of_numbers[start:stop:step])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "813e09bc",
   "metadata": {},
   "source": [
    "A handy trick is that you can print a reversed list entirely using double colons:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2bc8926",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(list_of_numbers[::-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39f80c9e",
   "metadata": {},
   "source": [
    "````{admonition} Exercise\n",
    "Slice the `list_example` from earlier to get only the first five entries.\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6ff3ca4",
   "metadata": {},
   "source": [
    "What's amazing about lists is that they can hold any type, including other lists! Here's a valid example of a list that's got a lot going on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "703fb0b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "wacky_list = [\n",
    "    3.1415,\n",
    "    16,\n",
    "    [\"five\", 4, 3],\n",
    "    (91, 93, 90),\n",
    "    \"Hello World!\",\n",
    "    True,\n",
    "    None,\n",
    "    {\"key\": \"value\", \"key2\": \"value2\"},\n",
    "]\n",
    "wacky_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7669396e",
   "metadata": {},
   "source": [
    "### Hierarchical Data in Lists\n",
    "\n",
    "Because lists can contain more lists (and so on), they can be used to put hierachical data in. Let's take a look at an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28fe13d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "multilayer_list = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]\n",
    "multilayer_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57a81b53",
   "metadata": {},
   "source": [
    "Now, say we wanted to reduce this to a single list. We can do it with a *list comprehension*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73a71803",
   "metadata": {},
   "outputs": [],
   "source": [
    "[x for little_list in multilayer_list for x in little_list]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e96185a",
   "metadata": {},
   "source": [
    "What we're saying here is take all of the values of every little list and put them into a single list."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2d41125",
   "metadata": {},
   "source": [
    "### From Lists to Data Frames\n",
    "\n",
    "Occassionally, you'll have data in lists that you wish to turn into a data frame. For example, perhaps you have a list of lists like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "007dadfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcfc2d3c",
   "metadata": {},
   "source": [
    "You can pass this straight into a constructor for a data frame as the `data=` keyword argument (adding in other info as necessary). Note that this is four lists of three entries, so the inner loop has entries in 0 to 2... it is this inner loop that will be used as the *rows* of any data frame with the number of entries in each inner list equal to the number of *columns*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aeb70f97",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.DataFrame(data=list_of_lists, columns=[\"a\", \"b\", \"c\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc797c89",
   "metadata": {},
   "source": [
    "There's one more trick to show you: explode. This is useful when you have data that has more than one level of list depth. Let's say you read in some data with a complex hierarchical structure like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa1d828c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"alpha\": [[0, 1, 2], \"foo\", [], [3, 4]],\n",
    "        \"beta\": 1,\n",
    "        \"gamma\": [[\"a\", \"b\", \"c\"], pd.NA, [], [\"d\", \"e\"]],\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91bb97aa",
   "metadata": {},
   "source": [
    "We have multiple rows and columns that contain lists. In some situations, it's fine to have a list in a column but here it's probably not as it's mixed in with other types of data. We can use `explode()` to split out the columns further length-wise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0854bcf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.explode(\"alpha\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7481d43b",
   "metadata": {},
   "source": [
    "## JSON (Java Script Object Notation)\n",
    "\n",
    "No discussion of programming and hierarchical data objects would be complete without mentioning JSON (Java Script Object Notation)! You will run into this text data format all of the time when dealing with data from the web, especially from APIs (automated web-based data services). JSON represents objects as name/value pairs, just like a Python dictionary (these have the format `{key1: value1, key2: value2}`).\n",
    "\n",
    "The table below compares the different data types found in Python and JSON.\n",
    "\n",
    "| JSON OBJECT   | PYTHON OBJECT |\n",
    "|---------------|---------------|\n",
    "| object        | dict          |\n",
    "| array         | list          |\n",
    "| string        | str           |\n",
    "| null          | None          |\n",
    "| number (int)  | int           |\n",
    "| number (real) | float         |\n",
    "| true          | True          |\n",
    "| false         | False         |\n",
    "\n",
    "There are typically two operations you may want to do with JSON data: 1) turn JSON data in a Python object (eg JSON to Python dictionary) or vice versa (known as deserialisation and serialisation respectively); and 2) converting a deserialised object into a *different* kind of Python object.\n",
    "\n",
    "Let's look at each in turn."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a72b5732",
   "metadata": {},
   "source": [
    "### Reading in JSON data\n",
    "\n",
    "Let's look at a typical example of reading in some JSON data.\n",
    "\n",
    "#### From the Web\n",
    "\n",
    "We'll get some JSON data from an API. Let's grab the latest UK unemployment data (timeseries code \"MGSX\" and dataset code \"LMS\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "285ade5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "url = \"https://api.beta.ons.gov.uk/v1/data?uri=employmentandlabourmarket/peoplenotinwork/unemployment/timeseries/mgsx/lms/previous/v106\"\n",
    "\n",
    "# Get the data from the ONS API:\n",
    "json_data = requests.get(url).json()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "051d3b4a",
   "metadata": {},
   "source": [
    "Let's check what type we got:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75399bf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "type(json_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22424c07",
   "metadata": {},
   "source": [
    "As expected, the JSON data has automatically been read in as a dictionary—but be wary that the fields have been read in as text rather than numbers, datetimes, and other specific data types.\n",
    "\n",
    "We could print the whole object out but that would take up a lot of space; instead let's look at a couple of entries under the \"months\" key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cc4f408",
   "metadata": {},
   "outputs": [],
   "source": [
    "json_data[\"months\"][:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce713def",
   "metadata": {},
   "source": [
    "#### From a File or Stream\n",
    "\n",
    "For this exercise, you'll need to download the JSON file 'cakes.json' from the [data folder of the repository](https://github.com/aeturrell/python4DS/tree/main/data) associated with this book and save it in a sub-folder called \"data\". We can take a peek at the data using the terminal (which is what the preceeding exclamation mark means):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05069731",
   "metadata": {},
   "outputs": [],
   "source": [
    "json_string = \"\"\"\n",
    "{\n",
    " \"food\": \"doughnut\",\n",
    " \"good_with\": [\"coffee\", \"tea\"],\n",
    " \"flavour\": null,\n",
    " \"toppings\": [{\"id\": 0, \"type\": \"glazed\"},\n",
    "              {\"id\": 1, \"type\": \"sugar\"}]\n",
    "}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c664ab6",
   "metadata": {},
   "source": [
    "We use the built-in **json** library to read this into Python (you could also use a file path here—more on how in a moment):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eca7982f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "result = json.loads(json_string)\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df41f92b",
   "metadata": {},
   "source": [
    "Note that not everything is the same in going from JSON text to a Python dictionary: JSON uses `null` rather than `None`, won't accept trailing commas at the end of lists, and has basic types that are lists, strings (and all keys must be strings), numbers, booleans, and nulls. Let's now see how to write a Python dictionary back to a JSON, perhaps for writing to file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea6f887d",
   "metadata": {},
   "outputs": [],
   "source": [
    "json_stream = json.dumps(result)\n",
    "json_stream"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f9445b8",
   "metadata": {},
   "source": [
    "To write to a file, you would use the pattern:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ba2facd",
   "metadata": {},
   "source": [
    "```python\n",
    "with open('data/json_data_output.json', 'w') as outfile:\n",
    "    json.dump(json_stream, outfile)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed0fd646",
   "metadata": {},
   "source": [
    "To read a file from disk, for example \"data/json_data_output.json\", it's\n",
    "\n",
    "```python\n",
    "json.load(open(\"data/json_data_output.json\"))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e10374c",
   "metadata": {},
   "source": [
    "### From JSON data to Data Frame\n",
    "\n",
    "**pandas** has lots of options for turning JSON or dictionary data into a data frame. You do need to think a little bit about the structure of the data underneath though:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1dbfd7c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.DataFrame(result[\"toppings\"], columns=[\"id\", \"type\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1346020",
   "metadata": {},
   "source": [
    "The web-scraped data we downloaded earlier had a more complicated structure, but **pandas** has a `json_normalize()` function that can cope with this. For example, with the following data, there are many missing entries but `json_normalize()` can still parse it into a Data Frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3ea0e0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "    {\"id\": 1, \"name\": {\"first\": \"Coleen\", \"last\": \"Volk\"}},\n",
    "    {\"name\": {\"given\": \"Mark\", \"family\": \"Regner\"}},\n",
    "    {\"id\": 2, \"name\": \"Faye Raker\"},\n",
    "]\n",
    "pd.json_normalize(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eaf00e1",
   "metadata": {},
   "source": [
    "And we can control the level that properties like 'name' are split out to as well (you can check out more options over at the [**pandas** documentation](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9349a3a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.json_normalize(data, max_level=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78d637e5",
   "metadata": {},
   "source": [
    "As well as the JSON normalise function, **pandas** has a `from_dict()` method to work with simpler dictionary objects."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
